Workshop Agenda

  • Introduction to R
  • Solar production -Analysis approach, packages and environment
  • Exploratory analysis and Data Cleansing
  • Build and compare 3 solar generation models: ARIMA & 2 Generalized Linear Models
  • Forecast Generation using ARIMA
  • Show me the $$ - sensitivity analysis of savings, ROI, IRR

Before we start - Install RStudio and R

  • Windows R: https://cran.r-project.org/bin/windows/base/

  • Windows RStudio:https://rstudio.com/products/rstudio/download/

  • For Mac:

  • Open an internet browser and go to https://www.r-project.org

  • Click the “download R” link in the middle of the page under “Getting Started.”

  • Select a CRAN location (a mirror site) and click the corresponding link.

  • Click on the “Download R for (Mac) OS X” link at the top of the page.

  • Click on the file containing the latest version of R under “Files.”

  • Save the .pkg file, double-click it to open, and follow the installation instructions.

  • Now that R is installed, you need to download and install RStudio

  • Go to www.rstudio.com and click on the “Download RStudio” button.

  • Click on “Download RStudio Desktop.”

  • Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your computer, double-click it to open, and then drag and drop it to your applications folder.

Quick survey

  • Used R?
  • Feel comfortable with linear regression?
  • Familiar with the forecast package?
  • Used ggplot2 or plotly?

Assumptions

  • Some background in R
  • Basic knowledge in probability
  • Familiar with linear regression

Quick R intro

install.packages("swirl")
library("swirl")
#swirl()

Goals

By the end of this workshop, you won’t become an expert in time series analysis and forecasting, but you will (hopefully) be:

  • More interested in exploring Data Science with R
  • Feel excited about applying R/Plotly/Tidyverse for analysis in your work
  • Know what it takes to build a basic time series forecasting model
  • Read my LinkedIn article

Admin

Workshop material

All today’s slides, code, and rmarkdown files are available on GitHub

Downloading the workshop material from the terminal:

git clone https://github.com/makanig/solarHomeEnergyAnalytics.git

Introduction to time series analysis

Time series analysis is commonly used in many fields of science, such as economics, finance, physics, engineering, and astronomy. The usage of time series analysis to understand past events and to predict future ones did not start with the introduction of the stochastic process during the past century. Ancient civilizations such as the Greeks, Romans, or Mayans, researched and learned how to utilize cycled events such as weather and astronomy to predict future events.

Time series analysis - is the art of extracting meaningful insights from time-series data to learn about past events and to predict future events.

This process includes the following steps:

  • Data collection - pulling the raw data from a database, API, flat files etc.
  • Data prep - cleaning, reformating (dates, classes, etc.), aggregating
  • Descriptive analysis - using statistical methods and data visualization tools to extract insights and learn about the series components and patterns
  • Predictive analysis - leveraging the insights learned in the descriptive process and apply some predictive model

Typical Time series in R

A typical analysis in R process will look like this (picture credit:Rami & Danton):

Time series objects

There are multiple classes in R for time-series data, the most common types are:

  • The ts class for regular time-series data, and mts class for multiple time seires objects , the most common class for time series data
  • The xts and zoo classes for both regular and irregular time series data, mainly popular in the financial field
  • The tsibble class, a tidy format for time series data, support both regular and irregular time-series data

The TSstudio package

The TSstudio package provides a set of functions for time series analysis. That includes interactive data visualization tools based on the plotly package engine, supporting multiple time series objects such as ts, xts, and zoo. The following diagram demonstrates the workflow of the TSstudio package:

Time series data

Time series data - is a sequence of values, each associate to a unique point in time that can divide to the following two groups:

  • Regular time series - is a sequence of observations which were captured at equally spaced time intervals (e.g., every month, week, day, hour, etc.)
  • Irregular time series - or unevenly spaced time series, is a sequence of observations which were not captured on equally spaced time intervals (for example rainy days, earthquakes, clinical trials, etc.)

Note: typically, the term time series data referred to regular time-series data. Therefore, if not stated otherwise, throughout the workshop the term time series (or series) refer to regular time-series data

Applications

With time series analysis, you can answer questions such as:

  • How many vehicles, approximately, going to be sold in the US in the next 12 months?
  • What will be the estimated demand for natural gas in the US in the next five years?
  • Generally, what will be the demand for electricity in the UK during the next 24 hours?

Solar Generation Forecast Analysis

What is Solar Net Energy Metering (NEM)

What can you expect from your online PG&E Analytics

Solar Forecast Analysis workflow

This builds on what you saw in the previous Time Series Analysis workflow slide with model comparison and financial sensitivity analysis

Our solar generation is an examples of time series data

## Start of program
## Check raw data - exploratory analysis
setwd(dataDir)

# get   generation
solarDf = tbl_df(read.csv(solarProductionDaily,  stringsAsFactors = F))
colnames(solarDf) <- c("DATE", "energyProducedWh")

# skip the total row
solarDf = solarDf %>% mutate(DATE = as.Date(DATE,"%Y-%m-%d")) %>%   na.omit()

solarDfTsibble = solarDf %>% select(DATE, energyProducedWh) %>% as_tsibble()
ts_plot(solarDfTsibble, 
        title = "Daily Solar Generation",
        Ytitle = "Wh Daily Production",
        # Xtitle = "Date",
        slider = TRUE)

Every Data Science project needs to fix missing/bad data

  • Let’s review the code and algorithm that’s going into imputing missing/wrong values
# Look at the fixed data
setwd(dataDir)
solarDf = getSolarDf()

solarDfTsibble = solarDf %>% select(DATE, energyProducedWh) %>% as_tsibble()
ts_plot(solarDfTsibble, 
        title = "Daily Solar Generation",
        Ytitle = "Wh Daily Production",
        # Xtitle = "Date",
        slider = TRUE)
# Get the daily temperature and precipitation
weatherDf = tbl_df(read.csv(weatherData, stringsAsFactors = F)) %>% 
  select( DATE, TMAX, TMIN, PRCP) 

weatherDf$DATE = as.Date(weatherDf$DATE,"%Y-%m-%d")
weatherDf$TEMP = (weatherDf$TMAX + weatherDf$TMIN)/2


# break up each of the attributes with gather
weatherDfGather = weatherDf %>% gather("attribute", "value", -DATE)
weatherDfGatherTs  = nest_AddMissingRows_xts(weatherDfGather, "attribute", "DATE", "value") 
weatherDfDaily = weatherDfGatherTs %>% spread(key = attribute, value=data.tsClean) %>% unnest() %>%
  mutate(DATE=tk_index(weatherDfGatherTs$data.tsClean[[1]] , timetk_idx = TRUE)) 


# include daily weather to Solar dataframe
solarDf = solarDf %>% inner_join(weatherDfDaily) %>% rowwise() %>% 
  mutate(DayLength = daylength(myLatitude, DATE))


# generate time series and plot seasonal
solarTs = ts(solarDf$energyProducedWh, start=c(solarDf$year[[1]], solarDf$yDay[[1]]), frequency = 365)
ts_seasonal(solarTs)

ARIMA parameter analysis

  • We will skip this (refer to Rami Krispin’s workshop for further details)
  • These plots are used to arrive at the ARIMA (2,1,1)(0,1,0) parameters
# examine the plot below for stationarity and order 


setwd(dataDir)
solarTsAdj = solarTs %>% stl(s.window='periodic')%>% seasadj() 
autoplot(solarTsAdj) # shows non-stationary, so use diff()

solarTsAdj %>% diff() %>% ggtsdisplay(main="") 

# do the same for the log transformed, bounded values
# To forecast with ARIMA, first look at the profile of entire series, the ts decompose to seasonal shows tiny weekly
# fluctuations, look for a better model
solarTsLog = ts(solarDf$energyProducedWhLog, start=c(solarDf$year[[1]], solarDf$yDay[[1]]), frequency = 365)
# comp = decompose(solarTsLog)
# plot(comp)
# # the seasonal pattern above is not too good, look for a better one


# examine the log plot below for stationarity and order 
solarTsLogAdj = solarTsLog %>% stl(s.window='periodic')%>% seasadj() 
autoplot(solarTsLogAdj) # shows non-stationary, so use diff()

solarTsLogAdj %>% diff() %>% ggtsdisplay(main="") 

GLM models & Training & Test sets

setwd(dataDir)
nReadings = nrow(solarDf)

# Use this to fit a sinusoidal pattern
solarDf$xc<-cos(2*pi*solarDf$DATE_Numeric/365)
solarDf$xs<-sin(2*pi*solarDf$DATE_Numeric/365)

# Split into training and test
halfWay = floor(nReadings/2)
trainingDf = as_tibble(solarDf[1:halfWay,]) %>% na.omit()
testDf = as_tibble(solarDf[halfWay+1:nReadings,]) %>% na.omit()
trainingLen = halfWay
testLen = nReadings - (halfWay+1) +1

Decomposition of GLM model into constituents

  • This type of analysis is used in SalesForce Einstein and similar tools
  • Understand the underlying factors for variance so you can act on it
setwd(dataDir)
decompose_df <- glm(energyProducedWhLog ~  xc+PRCP +DATE_Numeric*xc +
                      TMAX + TMIN, data=solarDf)

summary(decompose_df)
## 
## Call:
## glm(formula = energyProducedWhLog ~ xc + PRCP + DATE_Numeric * 
##     xc + TMAX + TMIN, data = solarDf)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.57045  -0.19717   0.09251   0.27796   1.12271  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.010e+00  5.320e-01   7.538 8.28e-14 ***
## xc              -3.330e+00  6.546e-01  -5.087 4.10e-07 ***
## PRCP            -7.399e-01  6.296e-02 -11.752  < 2e-16 ***
## DATE_Numeric    -2.911e-04  2.776e-05 -10.487  < 2e-16 ***
## TMAX             9.930e-03  1.916e-03   5.182 2.50e-07 ***
## TMIN            -8.066e-03  2.586e-03  -3.119  0.00185 ** 
## xc:DATE_Numeric  8.333e-05  3.759e-05   2.217  0.02677 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1948883)
## 
##     Null deviance: 3122.33  on 1486  degrees of freedom
## Residual deviance:  288.43  on 1480  degrees of freedom
## AIC: 1797.2
## 
## Number of Fisher Scoring iterations: 2
trend <- coef(decompose_df)[1] + coef(decompose_df)['DATE_Numeric']*solarDf$DATE_Numeric
components <- cbind(
  data = solarTsLog,
  fitted = decompose_df$fitted.values,
  trend = trend,
  season = coef(decompose_df)['xc']*solarDf$xc +
    coef(decompose_df)['xc:DATE_Numeric']*solarDf$xc*solarDf$DATE_Numeric,
  prcp = coef(decompose_df)['PRCP']*solarDf$PRCP,
  TMAX = coef(decompose_df)['TMAX']*solarDf$TMAX,
  TMIN = coef(decompose_df)['TMIN']*solarDf$TMIN,
  remainder = residuals(decompose_df)
)

p = (autoplot(components, facet=TRUE) + ggtitle("GLM model components")) 
p %>% ggplotly()
RSS <- c(crossprod(decompose_df$residuals))
#Mean squared error:
MSE <- RSS / length(decompose_df$residuals)
#Root MSE:
RMSE <- sqrt(MSE)
#Pearson estimated residual variance (as returned by summary.lm):
sig2 <- RSS / decompose_df$df.residual
gof = 1 - (decompose_df$deviance/decompose_df$null.deviance)

print(paste("RMSE:", RMSE, " Pearson residual variance", sig2, " GOF:", gof))
## [1] "RMSE: 0.440421251694386  Pearson residual variance 0.194888308776893  GOF: 0.907621987482646"

Compare models using training and test sets

setwd(dataDir)
tsStationary = diff(solarTsLog,differences=1)

# Forecast with  ARIMA, use the training set for the fit
solarDfTrainingTsLog = ts(data=trainingDf$energyProducedWhLog, start=c(trainingDf$year[[1]], trainingDf$yDay[[1]]),
                          frequency=365) 


if (useRDSFiles) {
  ar_211_010 = readRDS("arModel.rds")
} else {
  ar_211_010 = Arima(solarDfTrainingTsLog, order = c(2,1,1),seasonal = c(0,1,0))
  saveRDS(ar_211_010, "arModel.rds")
}

summary(ar_211_010)
## Series: solarDfTrainingTsLog 
## ARIMA(2,1,1)(0,1,0)[365] 
## 
## Coefficients:
##          ar1     ar2      ma1
##       0.3535  0.1000  -1.0000
## s.e.  0.0513  0.0522   0.0091
## 
## sigma^2 estimated as 0.411:  log likelihood=-371.73
## AIC=751.47   AICc=751.57   BIC=767.2
## 
## Training set error measures:
##                       ME      RMSE       MAE      MPE     MAPE      MASE
## Training set -0.01495805 0.4548676 0.2319034 18.38288 50.35202 0.4301588
##                     ACF1
## Training set 0.008524465
plot(x=ar_211_010$x, y = ar_211_010$fitted)

checkresiduals(ar_211_010)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,1)(0,1,0)[365]
## Q* = 226.45, df = 146, p-value = 2.22e-05
## 
## Model df: 3.   Total lags used: 149
# Ljung-Box test
# 
# data:  Residuals from ARIMA(2,1,1)(0,1,0)[365]
# Q* = 226.45, df = 146, p-value = 2.22e-05
# 
# Model df: 3.   Total lags used: 149

if (useRDSFiles) {
  aF = readRDS("aF_Files.rds")
} else {
  aF = forecast(ar_211_010,h = testLen)
  saveRDS(aF, "aF_Files.rds")
}

plot(aF$mean)

#arimaForecastValues = as_tibble(aF$mean)


# GLM Weather independent
# https://stats.stackexchange.com/questions/47840/linear-model-with-log-transformed-response-vs-generalized-linear-model-with-log

weatherIndependentGlmModel = glm(energyProducedWhLog ~ DATE_Numeric +  xc+ DATE_Numeric*xc + DATE_Numeric*xs + xs + 
                                   DATE_Numeric*dRank  + xc*dRank  + xs*dRank, 
                                 # family="gaussian"( link="log" ), 
                                 data = trainingDf)
summary(weatherIndependentGlmModel)
## 
## Call:
## glm(formula = energyProducedWhLog ~ DATE_Numeric + xc + DATE_Numeric * 
##     xc + DATE_Numeric * xs + xs + DATE_Numeric * dRank + xc * 
##     dRank + xs * dRank, data = trainingDf)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.80681  -0.13427   0.07929   0.30124   0.89137  
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.510e+01  2.292e+01   1.095 0.273812    
## DATE_Numeric       -1.520e-03  1.346e-03  -1.129 0.259109    
## xc                 -9.200e+00  1.875e+01  -0.491 0.623749    
## xs                 -1.862e+00  2.377e+00  -0.783 0.433651    
## dRank              -3.129e+01  4.592e+01  -0.682 0.495762    
## DATE_Numeric:xc     4.231e-04  1.101e-03   0.384 0.700971    
## DATE_Numeric:xs     1.230e-04  1.401e-04   0.878 0.380491    
## DATE_Numeric:dRank  1.846e-03  2.698e-03   0.684 0.494083    
## xc:dRank            1.305e-01  1.219e-01   1.070 0.285029    
## xs:dRank           -5.493e-01  1.546e-01  -3.554 0.000404 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2451588)
## 
##     Null deviance: 1648.7  on 742  degrees of freedom
## Residual deviance:  179.7  on 733  degrees of freedom
## AIC: 1075.9
## 
## Number of Fisher Scoring iterations: 2
weatherIndependentGlmForecast <- stats::predict(weatherIndependentGlmModel, type = "response", newdata=testDf) %>% 
  tibble::enframe()
weatherIndependentErr <- stats::predict(weatherIndependentGlmModel, newdata=testDf, se = TRUE)





weatherBasedGlmModel = glm(energyProducedWhLog ~  xc+PRCP +DATE_Numeric*xc +
                             TMAX + TMIN, 
                           # energyProducedWhLog ~  DATE_Numeric + dRank + PRCP  +  xc + xc*dRank + DATE_Numeric*xc +
                           #   TMAX + TMIN + DATE_Numeric*xs + DATE_Numeric*dRank + xs*dRank, 
                           #family="gaussian"( link="log" ), 
                           data = trainingDf)
summary(weatherBasedGlmModel) 
## 
## Call:
## glm(formula = energyProducedWhLog ~ xc + PRCP + DATE_Numeric * 
##     xc + TMAX + TMIN, data = trainingDf)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.54237  -0.18897   0.08387   0.26581   0.99582  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.373e+00  1.494e+00   5.603 2.98e-08 ***
## xc              -2.126e+00  1.916e+00  -1.110    0.267    
## PRCP            -8.540e-01  9.416e-02  -9.069  < 2e-16 ***
## DATE_Numeric    -5.390e-04  8.407e-05  -6.412 2.57e-10 ***
## TMAX             1.408e-02  2.805e-03   5.021 6.47e-07 ***
## TMIN            -1.608e-02  3.932e-03  -4.090 4.79e-05 ***
## xc:DATE_Numeric  1.175e-05  1.125e-04   0.104    0.917    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.2044866)
## 
##     Null deviance: 1648.7  on 742  degrees of freedom
## Residual deviance:  150.5  on 736  degrees of freedom
## AIC: 938.18
## 
## Number of Fisher Scoring iterations: 2
weatherBasedGlmForecast <- stats::predict(weatherBasedGlmModel, type = "response", newdata=testDf) %>% 
  tibble::enframe()
weatherBasedErr <- stats::predict(weatherBasedGlmModel, newdata=testDf, se = TRUE)

allTestVals = tibble(DATE = testDf$DATE,
                     ArimaVals= backTransformLogValues(aF$mean),
                     WeatherIndependent =backTransformLogValues(weatherIndependentGlmForecast$value),
                     WeatherBasedSeasonal =backTransformLogValues(weatherBasedGlmForecast$value ))

dfAllTestVals = solarDf %>% dplyr::left_join(allTestVals, by = "DATE") 


r = dfAllTestVals %>%
  select(DATE, energyProducedWh, ArimaVals, WeatherIndependent,  WeatherBasedSeasonal) %>% 
  mutate(Actuals=energyProducedWh) %>% select(-energyProducedWh) %>%
  melt(id.vars = c("DATE"), variable.name="Model")

g <- ggplot(data = r, 
            aes(x=DATE, y=value)) + ggtitle("Solar generation - Model comparison") + 
  xlab(NULL) + ylab("Watt hours Daily Production") +
  #  geom_line(aes(colour=variable, group=variable)) + 
  geom_point(aes(colour=Model, shape=Model, group=Model), size=1)
g %>% ggplotly()
dfAllTestDiffs = dfAllTestVals %>% mutate(ArimaValsDiff = energyProducedWh - ArimaVals,
                                          WeatherIndependentDiff = energyProducedWh - WeatherIndependent,
                                          WeatherBasedSeasonalDiff = energyProducedWh - WeatherBasedSeasonal)
rDiff = dfAllTestDiffs %>%
  select(DATE, ArimaValsDiff, WeatherIndependentDiff,  WeatherBasedSeasonalDiff) %>% 
  melt(id.vars = c("DATE"), variable.name="Model")

g <- ggplot(data = rDiff, 
            aes(x=DATE, y=value)) + ggtitle("Solar generation - Model difference comparison") + 
  xlab(NULL) + ylab("Actual vs estimated daily Watt hrs") +
  #  geom_line(aes(colour=variable, group=variable)) + 
  geom_point(aes(colour=Model, shape=Model, group=Model), size=1)
g %>% ggplotly()
modelSummary = tibble(modelMSPE = c("ArimaMSPE","WeatherIndependentGLM_MSPE", "WeatherBasedSeasonalGLM_MSPE"),
                      modelMSPEVals = c(mean(dfAllTestDiffs$ArimaValsDiff ^ 2, na.rm =T),
                                        mean(dfAllTestDiffs$WeatherIndependentDiff ^ 2, na.rm=T),
                                        mean(dfAllTestDiffs$WeatherBasedSeasonalDiff ^ 2, na.rm=T))) %>%
  arrange(modelMSPEVals)

# print out the final model summary                
modelSummary
## # A tibble: 3 x 2
##   modelMSPE                    modelMSPEVals
##   <chr>                                <dbl>
## 1 WeatherBasedSeasonalGLM_MSPE      3896533.
## 2 WeatherIndependentGLM_MSPE        4344898.
## 3 ArimaMSPE                         6356574.

25 year ARIMA forecast

setwd(dataDir)
######
##solarTsLog is the transformed time series
# redo the ARIMA model with the complete set of generated values

if (useRDSFiles) {
  ar_FullDfModel = readRDS("ar_FullDfModel.rds")
} else {
  ar_FullDfModel  = Arima(solarTsLog, order = c(2,1,1),seasonal = c(0,1,0))
  saveRDS(ar_FullDfModel, "ar_FullDfModel.rds")
}


year_15_EndDate =  solarInstallDate %m+% years(15)
year_20_EndDate =  solarInstallDate %m+% years(20)
year_25_EndDate =  solarInstallDate %m+% years(25)

# forecast once upto the longest period and take the subset of gen values
fLen = as.integer(year_25_EndDate - as.Date(last(solarDf$DATE)))
fLen15 = as.integer(year_15_EndDate - as.Date(last(solarDf$DATE)))
fLen20 = as.integer(year_20_EndDate - as.Date(last(solarDf$DATE)))

if (useRDSFiles) {
  ar_fullForecast = readRDS("ar_fullForecast.rds")
} else {
  ar_fullForecast = forecast(ar_FullDfModel, h = fLen)
  saveRDS(ar_fullForecast, "ar_fullForecast.rds")
}

## aF = forecast(ar_211_010,h = testLen)
#saveRDS(aF, "aF_Files.rds")

# getting dates back from timeseries is tricky
myForecastTimes  = as.vector(time(ar_fullForecast$mean))
myForecastedDates = as.Date(format(date_decimal(myForecastTimes), "%Y-%m-%d"))

myActualTimes  = as.vector(time(solarTs))
myActualDates = as.Date(format(date_decimal(myActualTimes), "%Y-%m-%d"))

forecastDf  = tibble(DATE=myForecastedDates,
                     ArimaPredictedDailyWh= as.numeric(backTransformLogValues(ar_fullForecast$mean)))
actualDf = tibble(DATE = myActualDates,
                  ActualDailyGenerationWh =  as.numeric( as.numeric(solarTs)))

fDif1  = forecastDf %>% melt(id.vars = c("DATE"), variable.name="Attribute")
actualDf1 = actualDf %>% melt(id.vars = c("DATE"), variable.name="Attribute")

fullGenDf = rbind(actualDf1,fDif1) # %>% mutate(DATE=format(date_decimal(DATE),"%Y-%m-%d"))
# x axis gets too busy, translate after plotting

g <- ggplot(data = fullGenDf, 
            aes(x=DATE, y=value)) + ggtitle("Solar generation ARIMA forecast") + 
  xlab(NULL) + ylab("Actual and forecasted daily Watt hrs") +
  #  geom_line(aes(colour=variable, group=variable)) + 
  geom_point(aes(colour=Attribute, shape=Attribute, group=Attribute), size=1)
g %>% ggplotly()

Financial analysis

ROI & IRR

  • How do you compare differing investment returns ?
setwd(dataDir)
# Financial and TOU calculations

fullGenDf = fullGenDf  %>% mutate(month = month(DATE),
                                  year  = year(DATE))
fullGenDf$DATE = as.Date(fullGenDf$DATE)

fullGenMonthlyDf = fullGenDf %>% group_by(year, month) %>% summarize(monthlyGenKwh=sum(value)/1000) %>% 
  mutate(DATE=make_date(year, month, 1)) %>% ungroup()

# Model 3 scenarios, two increasing electricity prices, another flat 
electricityCostDf1 = as_tibble(list(DATE=c(as.Date("2018-07-01"),as.Date("2019-07-01")), 
                                    costPerKwhPeak = c(0.371, 0.382),
                                    costPerKwhPartPeak = c(0.256, 0.267),
                                    costPerKwhOffPeak = c(0.18, 0.19),
                                    costProfile = "CostProfile1",
                                    model = "linear"))

electricityCostDf2 = as_tibble(list(DATE=c(as.Date("2015-07-01"), as.Date("2019-07-01"), as.Date("2025-07-01")),
                                    costPerKwhPeak = c(0.338, 0.382, 0.382),
                                    costPerKwhPartPeak = c(0.221, 0.267, 0.267),
                                    costPerKwhOffPeak = c(0.16, 0.19, 0.19),
                                    costProfile = "CostProfile2",
                                    model = "linear"))

electricityCostDf3 = as_tibble(list(DATE=c(as.Date("2018-07-01"), as.Date("2019-07-01"), as.Date("2025-07-01")),
                                    costPerKwhPeak = c(0.372, 0.382, 0.382*2),
                                    costPerKwhPartPeak = c(0.256, 0.267, 0.267*2),
                                    costPerKwhOffPeak = c(0.18, 0.19, 0.19*2),
                                    costProfile = "CostProfile3",
                                    model = "log"))

electricityCostDf4 = as_tibble(list(DATE=c(as.Date("2018-07-01"), as.Date("2019-07-01"), 
                                           as.Date("2022-07-01"), as.Date("2038-07-01")),
                                    costPerKwhPeak = c(0.372, 0.382, 0.19, 0.19),
                                    costPerKwhPartPeak = c(0.256, 0.267, 0.15, 0.15),
                                    costPerKwhOffPeak = c(0.18, 0.19, 0.11, 0.11),
                                    costProfile = "CostProfile4",
                                    model = "linear"))



costProfileList = list(electricityCostDf1,
                       electricityCostDf2,
                       electricityCostDf3,
                       electricityCostDf4)


# costProfileList = c(nest(electricityCostDf1),
#                     nest(electricityCostDf2),
#                     nest(electricityCostDf3))

genPreds = map(costProfileList,predictMonthlyGen,genMonthlyDf = fullGenMonthlyDf)
#r = rbindlist(genPreds)
# %>% group_by(costProfile) %>% arrange(DATE) %>%
#   mutate(cumMonthlySavings = cumsum(monthlySavings) - solarInvestment) %>% ungroup()

allScenarios = rbindlist(genPreds) %>% group_by(costProfile) %>% arrange(DATE) %>%
  mutate(cumMonthlySavings = cumsum(monthlySavings) - solarInvestment,
         realizedPricePerKwh = monthlySavings/monthlyGenKwh) %>% ungroup()

g1 <- ggplot(data = allScenarios,aes(x=DATE, y=cumMonthlySavings)) + 
  ggtitle("Solar investment scenarios") + 
  xlab(NULL) + ylab("Cumulative net savings") + 
  #  geom_line(aes(colour=variable, group=variable)) + 
  geom_line(aes(colour=costProfile), 
            size=1) +  theme(legend.position="none") #+ geom_text()
# geom_dl(aes(label=costProfile), method="last.points") 
#  geom_line(aes(y=solarInvestment, color='Investment')) 

g2 <- ggplot(data = allScenarios,aes(x=DATE, y=realizedPricePerKwh)) + 
  ggtitle("Solar investment scenarios") + 
  xlab(NULL) + ylab("Generated Kwh price") + # theme(legend.position="none") +
  #  geom_line(aes(colour=variable, group=variable)) + 
  geom_line(aes(colour=costProfile), 
            size=1) 
# geom_dl(aes(label=costProfile), method="last.points") 


#multiplot(g1, g2, cols=1) %>% ggplotly()

#multiplot(g1, g2, cols = 1) %>% ggplotly()

allScenariosPayback = allScenarios %>% group_by(costProfile) %>% 
  arrange(DATE) %>% filter(cumMonthlySavings > 0) %>% 
  filter(cumMonthlySavings == min(cumMonthlySavings)) %>% 
  mutate(paybackPeriodDays = DATE - solarInstallDate,
         paybackYrs = as.numeric(paybackPeriodDays)/365) %>% 
  select(costProfile, paybackYrs, DATE)

allScenariosReturnROI = allScenarios %>% group_by(costProfile) %>% 
  arrange(DATE) %>% summarise(IRR = irr(c(-solarInvestment,
                                          monthlySavings))*12*100,
                              NetReturn=last(cumMonthlySavings),
                              ROI=NetReturn/solarInvestment*100)

mySummary = allScenariosPayback %>% left_join(allScenariosReturnROI)

# round off lengthy decimal points
is.num <- sapply(mySummary, is.numeric)
mySummary[is.num] <- lapply(mySummary[is.num], round, 2)

mySummary = mySummary %>% arrange(-ROI) %>% mutate(NetReturn = sprintf("$%.0f",NetReturn),
                                                   ROI = sprintf("%.0f %%", ROI),
                                                   IRR = sprintf("%.0f %%", IRR)) 


sumPlot <- plot_ly(
  type = 'table',
  header = list(
    values = c(colnames(mySummary)),
    align = c('left', rep('center', ncol(mtcars))),
    line = list(width = 1, color = 'black'),
    fill = list(color = '#daf2ed'),
    font = list(family = "Arial", size = 14, color = "black")
  ),
  cells = list(
    values = rbind(
      # rownames(mtcars), 
      t(as.matrix(unname(mySummary)))
    ),
    align = c('left', rep('center', ncol(mtcars))),
    line = list(color = "black", width = 1),
    fill = list(color = c('rgb(235, 193, 238)', 'rgba(228, 222, 249, 0.65)')),
    font = list(family = "Arial", size = 12, color = c("black"))
  ))

subplot(g1,g2, nrows=2, titleY = TRUE)
sumPlot

Summary of Generation, Usage and NEM

setwd(dataDir)

# Billing data analysis

pgeElectricBill = tbl_df(read.csv(pgeElectricBilling, stringsAsFactors = F, 
                                  skip = electricSkipLines))
pgeElectricBill$START.DATE = as.Date(pgeElectricBill$START.DATE,"%Y-%m-%d")
pgeElectricBill$END.DATE = as.Date(pgeElectricBill$END.DATE,"%Y-%m-%d")
pgeElectricBill$COST = as.numeric(sub("\\$","", pgeElectricBill$COST))

pgeElectricBill$numDays =as.integer(pgeElectricBill$END.DATE - 
                                      pgeElectricBill$START.DATE)
pgeElectricBill$avgDailyUsage  = pgeElectricBill$USAGE/pgeElectricBill$numDays
pgeElectricBill$KwhCost    =  pgeElectricBill$COST/pgeElectricBill$USAGE
pgeElectricBill = pgeElectricBill %>% select(-NOTES) %>% mutate(year = year(START.DATE),
                                                                month = month(START.DATE))

g1 = ggplot(pgeElectricBill) + aes(x = START.DATE) + geom_line(aes(y = USAGE)) +
  scale_colour_manual(values=c("red", "green")) +
  xlab(NULL) + ylab("Monthly Usage (Kwh)")  + #+
  ggtitle("Monthly Electric usage") 

g2 = ggplot(pgeElectricBill) + aes(x = START.DATE) + geom_line(aes(y = COST)) +
  xlab(NULL) + ylab("Monthly Cost($)")  + #+
  scale_colour_manual(values=c("red", "green")) +
  ggtitle("Monthly Electric usage and $$ spend") 

subplot(g1,g2,nrows=2, titleY = TRUE)
lastSolarReading = last(solarDf$DATE)


l = nrow(pgeElectricBill)
s = seq(as.Date(pgeElectricBill$START.DATE[1]), as.Date(pgeElectricBill$END.DATE[l]), "days") %>% 
  tibble::enframe(name=NULL) 
colnames(s) = c("DATE")
s = s %>% mutate(year=year(DATE),month=month(DATE)) %>% filter(DATE < lastSolarReading)
s2 = s %>% full_join(pgeElectricBill, by=c("year","month")) %>% mutate(usage=avgDailyUsage) %>%
  select(DATE,usage,KwhCost) %>% filter(!is.na(usage))

#KwhCost not useful because of extreme range of values
sDf = getSolarDf()
# merge with solarDf, the ts seems to have missing dates
s3 = s2 %>% left_join(sDf, by="DATE") %>% mutate(genKwh = energyProducedWh/1000,
                                                 netUsage = usage) %>% select(DATE, netUsage, genKwh)
s3$genKwh[s3$DATE < solarInstallDate] = 0
#s2[is.na(s2)] <- 0
s3 = s3 %>% mutate(usage=netUsage+genKwh)




# get monthly values, remove partial months
s3 <- s3 %>% 
  mutate(month = month(DATE),
         year  = year(DATE),
         DATE = floor_date(DATE, "month")) %>%
  group_by(year, month, DATE) %>%
  add_count() %>%
  summarise(monthlyUsage = sum(usage), 
            numDays = n(),
            monthlyNetUsage = sum(netUsage),
            monthlyGen = sum(genKwh)) %>%  ungroup() %>%
  filter(numDays >= 25) %>% 
  select(-numDays)


s3$context = "No Solar"
s3$context[s3$DATE > solarInstallDate] = "Solar"
s3$context[s3$DATE > electricCarDeployDate] = "Solar & Electric Car"

g1 = ggplot(s3) + aes(x = DATE) + geom_line(aes(y = monthlyUsage, colour = context)) +
  xlab(NULL) + ylab("Monthly Usage (Kwh)")  + #+
  ggtitle("Monthly Electric usage & generation") 

g2 = ggplot(s3) + aes(x = DATE) + geom_line(aes(y = monthlyGen, colour = context)) +
  xlab(NULL) + ylab("Monthly Generation (Kwh)")  + #+
  ggtitle("Monthly Electric usage & generation") 

g3 = ggplot(s3) + aes(x = DATE) + geom_line(aes(y = monthlyNetUsage, colour = context)) +
  xlab(NULL) + ylab("NEM(Kwh)")  + #+
  ggtitle("Monthly Electric usage & generation") 


subplot(g1,g2,g3,nrows=3, titleY = TRUE)